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Abstract 

We introduce a parallel GPU implementation of the Simple Linear Iterative Clustering (SLIC) superpixel 
segmentation. Using a single graphic card, our implementation achieves speedups of up to 83 x from the 
standard sequential implementation. Our implementation is fully compatible with the standard sequential 
implementation and the software is now available online and is open source. 
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1 Introduction 

Superpixels are regions of pixels grouped in some perceptually meaningful way, usually following colour or 
boundary cues. They are designed to produce a simpler and more compact representation for an image, while 
keeping its semantic meaning intact. Superpixel segmentation is used most often as an image preprocessing 
step, with a view towards reducing computational complexity for subsequent processing steps. 

The term superpixel, along with the first notable superpixel algorithm, was introduced by [1]. Many algo¬ 
rithms followed, using various types of image features, various optimisation strategies and various implemen¬ 
tations techniques. These algorithms have varying specifications and performance requirements. For example, 
some algorithms aim to find a fixed number of superpixels, others try to find the minimum possible number 
of superpixels by imposing a colour cohesion requirement, while others place emphasis on matching image 
boundaries. Sometimes fast processing is required, when for example, the superpixel algorithm is used as a 
precursor to a tracker. Sometimes superpixels are designed not to under-segment the image, when used for 
example as a means of condensing the image information, to serve as the basis of a labelling problem. Re¬ 
gardless of its design, superpixel segmentation is usually among the first steps in a much longer processing 
pipeline. Therefore, we believe that, for any superpixel method to be useful, it must satisfy the two following 
requirements: 

• it should not decrease the performance of the full processing pipeline; 

• it should be fast. 
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Figure 1: Example SLIC superpixel segmentation 


The performance requirement is satisfied in many vision applications by superpixels that are compact, 
uniform wA follow image edges. These requirements motivated the simple iterative clustering algorithm 
(SLIC) algorithm of [2] . This is simple, efficient and suitable for real-time operation. Still however, the CPU- 
sequential implementation of SLIC need 300~400ms to segment a single 640x480 image. Reducing the number 
of iterations for each clustering pass can make the algorithm faster, at the cost of a decrease in performance. 

In this work, we propose a GPU implementation of the SLIC algorithm, using the NVIDIA CUDA frame¬ 
work. Our implementation is up to 83 x faster than the original CPU implementation of [2], making it, to our 
knowledge, the fastest superpixel method to date. 

Our full source code with a simple example can be downloaded from https : / /github . com/car Iren/ 
gSLICr. The following sections describe in detail our algorithm and implementation. 

2 Simple Linear Iterative Clustering (SLIC) 

The Simple Linear Iterative Clustering (SLIC) algorithm for superpixel segmentation was proposed in [2]. An 
example segmentation result is shown in 1 . 

SLIC uses a k-means-based approach to build local clusters of pixels in the 5D [labxy] space defined by 
the L,a,b values of the CIELAB color space and the x,y pixel coordinates. The CIELAB color space is chosen 
because it is perceptually uniform for a small distance in colour. 

SEIC uses as input the desired number of approximately equally-sized superpixels K. Given an image 
with N pixels, the approximate size of each superpixel therefore is N/K. Assuming roughly equally sized 
superpixels, there would be a superpixel center at every grid interval S = -s/N/K. Eet be the 5D 

point corresponding to a pixel. Writing the cluster center Ck as Cu = [lk,ak,bk,Xk,ykY, SEIC defines a disfance 
measure as: 

diab = \l (4 - kY + [ak - aiY + [uk - biY 

dxy = \l {xk-XiY + {yk-yiY 

m 

Ds — diab T 

where Ds is fhe sum of fhe lab disfance and fhe xy plane disfance normalized by fhe grid inferval S. The 
variable m confrols fhe compacfness of superpixels i.e. fhe greafer fhe value of m, fhe more spatial proximify is 
emphasized and fhus fhe more compacf fhe clusfer becomes. 

This disfance mefric is nexf used in a sfandard local k-means algorifhm. Eirsf, fhe cluster centers are 
perfurbed fo fhe lowers gradienf position from a local neighborhood. Nexf, iferafively, fhe algorifhm assigns fhe 
besf mafching pixels in a local neighborhood fo each clusfer and compufes new cenfer locafions. The process 
is stopped when fhe clusfer centers sfabilise i.e. when fhe El disfance befween cenfers af consecufive iferafions 
is smaller fhan a sef fhreshold. Einally, connecfively is enforced to cleanup fhe final superpixel laffice. 

In fhe nexf secfion we defail our GPU implemenfafion of SEIC. 
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Figure 2: Workflow of gSLICr 


3 gSLICr GPU Implementation 

We split our implementation into two parts, as shown in Figure 2: the GPU is responsible for most of the 
processing, with only data acquisition and visualization being left for the CPU. 

The GPU implementation then proceeds as follows: 

• Image space conversion: The RGB input image is converted to Cielab, using one thread for each pixel. 

• Cluster center initialisation: We use one thread per cluster center (i.e. superpixel) to initialise our 
superpixel map. This is an nsr x nsc image which contains, for each entry, center coordinates, number of 
associated pixels and colour information. nSr and nSc represent the number of superpixels per image row 
and column, respectively. 

• Finding the cluster associations: Each pixel in the image determines what is its closest cluster using 
the 5D distance detailed in the previous section. This requires a maximum of nine cluster centers to be 
examined and is done using one thread per pixel. 

• Updating the cluster center: Here we update each cluster center using the pixels assigned to it. This 
process is done in two separate kernels. First, each cluster center must access all pixels associated to it, 
within a local neighborhood that is a function of the superpixel size. Here we use nSr x nSc x nt,i, where 
nsr and nsc are defined as before, rihi = spixeLsize x 3/BLOCK JDIM captures the number of pixels 
each thread can process, as a function of superpixel size and thread block dimension (16 in our case). 
The result is written to an image of size nsr x nsc x Uhi, upon which we run a reduction step on the third 
dimension to obtain the final updated cluster center positions. 

• Enforce connectivity: We eliminate stray pixels with two one thread per pixel calls of the same kernel. 
This prompts a pixel to change its label of that of the surrounding pixels (in a 2 x 2 neighborhood) if all 
have a different label. 


4 Library Usage 

Our full code can be downloaded from https : //github . com/carlren/gSLICr. It consists of (i) a 
demo project (which requires OpenCV) and (ii) a separate library (which has no dependencies). 

The demo project acquires images from the camera, processes them through the library and displays the 
result back in an OpenCV window. It creates an instances of the core_engine class, which is the main access 
point to our code. This controls the segmentation code in the seg_engine class and times the result. 

The seg_engine class is responsible for all the superpixel processing, and the algorithm is controlled from 
the Perform_Segmentation method. This code is listed below: 
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Figure 3: Cross Device Engine Design Pattern 


void seg .engine :: Perform_S egmentation (UChar4Image* in.img) 

{ 

source.img—>SetFrom(in_img , ORUtils : : MemoryBlock<Vector4u > :: CPU.TO.CUDA) ; 
Cvt_Img_Space(source_img , cvt.img , gSLICr.settings . color.space) ; 

Init.Cluster.Centers(); 

Find_Center_Association () ; 

for ( int i = 0; i < gSLICr.settings . no .iters ; i++) 

{ 

Update _Clu s ter .Center () ; 

Find.Center.Association () ; 

} 

if ( gSLICr.settings . do.enforce.connectivity ) Enforce.Connectivity () ; 
cudaThreadSynchronize () ; 

] _ 

Similar to the other projects from the Oxford Vision Library, such as LibISR [3] or InfiniTAM [4], this 
class follows our cross device engine design pattern outlined in Figure 3. The engine is split into 3 layers. 
The topmost, so called Abstract Layer, contains the main algorithm function calls (listed above). The abstract 
interface is implemented in the next. Device Specific Layer, which may be very different between e.g. a CPU 
and a GPU implementation. Further implementations using e.g. OpenMP or other hardware acceleration archi¬ 
tectures are possible. We only provide a GPU implementation in this case, in the gSLICr.seg.engine_GPU.h 
and gSLICr.seg_engine_GPU.cu files. Af fhe fhird. Device Agnostic Layer, fhere is some inline C-code fhaf 
may be called from fhe higher layers. This confains fhe bulk of fhe per-pixel and per-clusfer processing code 
and can be found in fhe gSLICr_seg_engine.shared.h file. 

5 Results 

Our implemenfafion is designed fo produce fhe same resulf as fhe sequential SLIC implementation of [2], so 
qualitatively and quantitatively the results are virtually identical. Our method is however considerably faster 
than this, and, to our knowledge, all other superpixel segmentation techniques. We included a comparison with 
many such techniques in Table 1 . Here we used four images sizes, and our method was consistently much faster 
than any other approach. Compared to the original SLIC algorithm, our approach is up to 83 x faster. 
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method 

1024 X 1024 

3631 X 3859 

963 X 1024 

1002 X 1002 

933 X 800 

Current Work 

1000 spx 

0.01 

0.12 

0.01 

0.01 

0.008 

Current Work 

2000 spx 

0.01 

0.12 

0.01 

0.01 

0.008 

Achanta [2] 
1000 spx 

0.73 

9.69 

0.75 

0.70 

0.55 

Achanta [2] 
2000 spx 

0.74 

9.88 

0.72 

0.71 

0.55 

Veksler [5] 
patch size 25 

22.6 

321 

20.6 

19.3 

16.7 

Veksler [5] 
patch size 50 

25.2 

368 

21.6 

19.7 

17.7 

Zhang [6] 
patch size 40 

0.63 

7.49 

0.49 

0.50 

0.36 

Zhang [6] 
patch size 60 

0.60 

7.53 

0.48 

0.50 

0.36 

Comaniciu [7] 
sp. bandw. 11 

12.1 

X 

12.3 

9.73 

9.47 

Comaniciu [7] 
sp. bandw. 23 

43.3 

X 

53.8 

32.5 

34.2 

Shi [1] 

1000 spx 

329 

X 

316 

334 

218 

Shi [1] 

2000 spx 

515 

X 

384 

411 

338 

Felzenszwalb 
[8] a = 0.4 

0.62 

17.4 

0.96 

0.85 

0.41 

Felzenszwalb 
[8] a = 0.5 

0.70 

18.3 

0.97 

0.92 

0.47 

Liu [9] 

500 spx 

6.76 

116 

7.34 

7.00 

5.34 

Liu [9] 

1000 spx 

6.94 

116 

6.90 

7.10 

5.35 

Liu [9] 

2000 spx 

7.26 

120 

7.23 

7.60 

5.62 

den Bergh [10] 
200 spx 

1.53 

22.1 

1.67 

1.39 

1.01 

den Bergh [10] 
400 spx 

2.36 

31.2 

2.35 

2.00 

1.42 

Levinshtein 
[11] 1000 spx 

38.7 

X 

46.6 

52.1 

26.9 

Levinshtein 
[11] 2000 spx 

38.4 

X 

44.0 

55.6 

26.8 

Moore [12] 
bounds [13] 

2.07 

X 

2.20 

1.79 

1.37 

Moore [12] 
bounds [14] 

2.44 

X 

2.12 

2.17 

1.25 


Table 1: Timing results for the tested methods. 
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